17 research outputs found

    Self-management for large-scale distributed systems

    Get PDF
    Autonomic computing aims at making computing systems self-managing by using autonomic managers in order to reduce obstacles caused by management complexity. This thesis presents results of research on self-management for large-scale distributed systems. This research was motivated by the increasing complexity of computing systems and their management. In the first part, we present our platform, called Niche, for programming self-managing component-based distributed applications. In our work on Niche, we have faced and addressed the following four challenges in achieving self-management in a dynamic environment characterized by volatile resources and high churn: resource discovery, robust and efficient sensing and actuation, management bottleneck, and scale. We present results of our research on addressing the above challenges. Niche implements the autonomic computing architecture, proposed by IBM, in a fully decentralized way. Niche supports a network-transparent view of the system architecture simplifying the design of distributed self-management. Niche provides a concise and expressive API for self-management. The implementation of the platform relies on the scalability and robustness of structured overlay networks. We proceed by presenting a methodology for designing the management part of a distributed self-managing application. We define design steps that include partitioning of management functions and orchestration of multiple autonomic managers. In the second part, we discuss robustness of management and data consistency, which are necessary in a distributed system. Dealing with the effect of churn on management increases the complexity of the management logic and thus makes its development time consuming and error prone. We propose the abstraction of Robust Management Elements, which are able to heal themselves under continuous churn. Our approach is based on replicating a management element using finite state machine replication with a reconfigurable replica set. Our algorithm automates the reconfiguration (migration) of the replica set in order to tolerate continuous churn. For data consistency, we propose a majority-based distributed key-value store supporting multiple consistency levels that is based on a peer-to-peer network. The store enables the tradeoff between high availability and data consistency. Using majority allows avoiding potential drawbacks of a master-based consistency control, namely, a single-point of failure and a potential performance bottleneck. In the third part, we investigate self-management for Cloud-based storage systems with the focus on elasticity control using elements of control theory and machine learning. We have conducted research on a number of different designs of an elasticity controller, including a State-Space feedback controller and a controller that combines feedback and feedforward control. We describe our experience in designing an elasticity controller for a Cloud-based key-value store using state-space model that enables to trade-off performance for cost. We describe the steps in designing an elasticity controller. We continue by presenting the design and evaluation of ElastMan, an elasticity controller for Cloud-based elastic key-value stores that combines feedforward and feedback control

    Achieving Robust Self-Management for Large-Scale Distributed Applications

    Get PDF
    Autonomic managers are the main architectural building blocks for constructing self-management capabilities of computing systems and applications. One of the major challenges in developing self-managing applications is robustness of management elements which form autonomic managers. We believe that transparent handling of the effects of resource churn (joins/leaves/failures) on management should be an essential feature of a platform for self-managing large-scale dynamic distributed applications, because it facilitates the development of robust autonomic managers and hence improves robustness of self-managing applications. This feature can be achieved by providing a robust management element abstraction that hides churn from the programmer. In this paper, we present a generic approach to achieve robust services that is based on finite state machine replication with dynamic reconfiguration of replica sets. We contribute a decentralized algorithm that maintains the set of nodes hosting service replicas in the presence of churn. We use this approach to implement robust management elements as robust services that can operate despite of churn. Our proposed decentralized algorithm uses peer-to-peer replica placement schemes to automate replicated state machine migration in order to tolerate churn. Our algorithm exploits lookup and failure detection facilities of a structured overlay network for managing the set of active replicas. Using the proposed approach, we can achieve a long running and highly available service, without human intervention, in the presence of resource churn. In order to validate and evaluate our approach, we have implemented a prototype that includes the proposed algorithm

    Self-Management for Large-Scale Distributed Systems

    No full text
    Autonomic computing aims at making computing systems self-managing by using autonomic managers in order to reduce obstacles caused by management complexity. This thesis presents results of research on self-management for large-scale distributed systems. This research was motivated by the increasing complexity of computing systems and their management. In the first part, we present our platform, called Niche, for programming self-managing component-based distributed applications. In our work on Niche, we have faced and addressed the following four challenges in achieving self-management in a dynamic environment characterized by volatile resources and high churn: resource discovery, robust and efficient sensing and actuation, management bottleneck, and scale. We present results of our research on addressing the above challenges. Niche implements the autonomic computing architecture, proposed by IBM, in a fully decentralized way. Niche supports a network-transparent view of the system architecture simplifying the design of distributed self-management. Niche provides a concise and expressive API for self-management. The implementation of the platform relies on the scalability and robustness of structured overlay networks. We proceed by presenting a methodology for designing the management part of a distributed self-managing application. We define design steps that include partitioning of management functions and orchestration of multiple autonomic managers. In the second part, we discuss robustness of management and data consistency, which are necessary in a distributed system. Dealing with the effect of churn on management increases the complexity of the management logic and thus makes its development time consuming and error prone. We propose the abstraction of Robust Management Elements, which are able to heal themselves under continuous churn. Our approach is based on replicating a management element using finite state machine replication with a reconfigurable replica set. Our algorithm automates the reconfiguration (migration) of the replica set in order to tolerate continuous churn. For data consistency, we propose a majority-based distributed key-value store supporting multiple consistency levels that is based on a peer-to-peer network. The store enables the tradeoff between high availability and data consistency. Using majority allows avoiding potential drawbacks of a master-based consistency control, namely, a single-point of failure and a potential performance bottleneck. In the third part, we investigate self-management for Cloud-based storage systems with the focus on elasticity control using elements of control theory and machine learning. We have conducted research on a number of different designs of an elasticity controller, including a State-Space feedback controller and a controller that combines feedback and feedforward control. We describe our experience in designing an elasticity controller for a Cloud-based key-value store using state-space model that enables to trade-off performance for cost. We describe the steps in designing an elasticity controller. We continue by presenting the design and evaluation of ElastMan, an elasticity controller for Cloud-based elastic key-value stores that combines feedforward and feedback control.QC 20120831</p

    Enabling and Achieving Self-Management for Large Scale Distributed Systems : Platform and Design Methodology for Self-Management

    No full text
    Autonomic computing is a paradigm that aims at reducing administrative overhead by using autonomic managers to make applications self-managing. To better deal with large-scale dynamic environments; and to improve scalability, robustness, and performance; we advocate for distribution of management functions among several cooperative autonomic managers that coordinate their activities in order to achieve management objectives. Programming autonomic management in turn requires programming environment support and higher level abstractions to become feasible. In this thesis we present an introductory part and a number of papers that summaries our work in the area of autonomic computing. We focus on enabling and achieving self-management for large scale and/or dynamic distributed applications. We start by presenting our platform, called Niche, for programming self-managing component-based distributed applications. Niche supports a network-transparent view of system architecture simplifying designing application self-* code.  Niche provides a concise and expressive API for self-* code. The implementation of the framework relies on scalability and robustness of structured overlay networks. We have also developed a distributed file storage service, called YASS, to illustrate and evaluate Niche. After introducing Niche we proceed by presenting a methodology and design space for designing the management part of a distributed self-managing application in a distributed manner. We define design steps, that includes partitioning of management functions and orchestration of multiple autonomic managers. We illustrate the proposed design methodology by applying it to the design and development of an improved version of our distributed storage service YASS as a case study. We continue by presenting a generic policy-based management framework which has been integrated into Niche. Policies are sets of rules that govern the system behaviors and reflect the business goals or system management objectives. The policy based management is introduced to simplify the management and reduce the overhead, by setting up policies to govern system behaviors. A prototype of the framework is presented and two generic policy languages (policy engines and corresponding APIs), namely SPL and XACML, are evaluated using our self-managing file storage application YASS as a case study. Finally, we present a generic approach to achieve robust services that is based on finite state machine replication with dynamic reconfiguration of replica sets. We contribute a decentralized algorithm that maintains the set of resource hosting service replicas in the presence of churn. We use this approach to implement robust management elements as robust services that can operate despite of churn.  QC 2010052

    ElastMan : Autonomic Elasticity Manager for Cloud-Based Key-Value Stores

    No full text
    The increasing spread of elastic Cloud services, together with the pay-asyou-go pricing model of Cloud computing, has led to the need of an elasticity controller. The controller automatically resizes an elastic service, in response to changes in workload, in order to meet Service Level Objectives (SLOs) at a reduced cost. However, variable performance of Cloud virtual machines and nonlinearities in Cloud services, such as the diminishing reward of adding a service instance with increasing the scale, complicates the controller design. We present the design and evaluation of ElastMan, an elasticity controller for Cloud-based elastic key-value stores. ElastMan combines feedforward and feedback control. Feedforward control is used to respond to spikes in the workload by quickly resizing the service to meet SLOs at a minimal cost. Feedback control is used to correct modeling errors and to handle diurnal workload. To address nonlinearities, our design of ElastMan leverages the near-linear scalability of elastic Cloud services in order to build a scale-independent model of the service. Our design based on combining feedforward and feedback control allows to efficiently handle both diurnal and rapid changes in workload in order to meet SLOs at a minimal cost. Our evaluation shows the feasibility of our approach to automation of Cloud service elasticity.QC 20120831</p

    Optimisation-based approaches for data analysis

    No full text
    Recent advances in science and technology promote the generation of a huge amount of data from various sources including scientific experiments, social surveys and practical observations. The availability of powerful computer hardware and softyvare offers easier ways to store datasets. However, more efficient and accurate methodologies are required to analyse datasets and extract useful information from them. This work aims at applying mathematical programming and optimisation methodologies to analyse different forms of datasets. The research focuses on three areas including data . classification, community structure identification of complex networks and DNA motif discovery. Firstly, a general data classification problem is investigated. A mixed integer optfmisation~based approach is proposed to reveal the patterns hidden behind training data samples using a hyper-box representation. An efficient solution methodology is then developed to extend the applicability of hyper-box classifiers to datasets with many training samples and complex structures. . Secondly, the network community structure identification problem is addressed. The proposed mathematical model finds optimal modular structures of complex networks through the maximisation of network modularity metric. Communities of mediumllarge networks are identified through a two-stage solution algorithm developed in this thesis. Finally, the third part presents an optimisation-based framework to extract DNA motifs and consensus sequences. The problem is formulated as a mixed integer linear programming model and an iterative solution procedure is developed to identify multiple motifs in each DNA sequence. The flexibility of the proposed motif finding approach is then demonstrated to incorporate other biological features.EThOS - Electronic Theses Online ServiceGBUnited Kingdo

    BwMan: Bandwidth Manager for Elastic Services in the Cloud

    No full text
    The flexibility of Cloud computing allows elastic services to adapt to changes in workload patterns in order to achieve desired Service Level Objectives (SLOs) at a reduced cost. Typically, the service adapts to changes in workload by adding or removing service instances (VMs), which for stateful services will require moving data among instances. The SLOs of a distributed Cloud-based service are sensitive to the available network bandwidth, which is usually shared by multiple activities in a single service without being explicitly allocated and managed as a resource. We present the design and evaluation of BwMan, a network bandwidth manager for elastic services in the Cloud. BwMan predicts and performs the bandwidth allocation and tradeoffs between multiple service activities in order to meet service specific SLOs and policies. To make management decisions, BwMan uses statistical machine learning (SML) to build predictive models. This allows BwMan to arbitrate and allocate bandwidth dynamically among different activities to satisfy specified SLOs. We have implemented and evaluated BwMan for the OpenStack Swift store. Our evaluation shows the feasibility and effectiveness of our approach to bandwidth management in an elastic service. The experiments show that network bandwidth management by BwMan can reduce SLO violations in Swift by a factor of two or more.Peer ReviewedPostprint (published version

    OnlineElastMan : self-trained proactive elasticity manager for cloud-based storage services

    No full text
    The pay-as-you-go pricing model and the illusion of unlimited resources in the Cloud initiate the idea to provision services elastically. Elastic provisioning of services allocates/de-allocates resources dynamically in response to the changes of the workload. It minimizes the service provisioning cost while maintaining the desired service level objectives (SLOs). Model-predictive control is often used in building such elasticity controllers that dynamically provision resources. However, they need to be trained, either online or offline, before making accurate scaling decisions. The training process involves tedious and significant amount of work as well as some expertise, especially when the model has many dimensions and the training granularity is fine, which is proved to be essential in order to build an accurate elasticity controller. In this paper, we present OnlineElastMan, which is a self-trained proactive elasticity manager for cloud-based storage services. It automatically evolves itself while serving the workload. Experiments using OnlineElastMan with Cassandra indicate that OnlineElastMan continuously improves its provision accuracy, i.e., minimizing provisioning cost and SLO violations, under various workload patterns

    Self-Management with Niche, a Platform for Self-Managing Distributed Applications

    No full text
    We present Niche, a general-purpose distributed component management system used to develop, deploy and execute self-managing distributed applications. Niche consists of both a component-based programming model as well as a distributed runtime environment. It is especially designed for complex distributed applications that run and manage themselves in dynamic and volatile environments. Self-management in dynamic environments is challenging due to the high rate of system or environmental changes and the corresponding need to frequently reconfigure, heal, and tune the application. The challenges are met partly by making use of an underlying overlay in the platform to provide an efficient, location-independent, and robust sensing and actuation infrastructure, and partly by allowing for maximum decentralization of management. We describe the overlay services, the execution environment, showing how the challenges in dynamic environments are met. We also describe the programming model and a high-level design methodology for developing decentralized management, illustrated by two application case studies

    SpanEdge: Towards Unifying Stream Processing over Central and Near-the-Edge Data Centers

    No full text
    In stream processing, data is streamed as a continuous flow of data items, which are generated from multiple sources and geographical locations. The common approach for stream processing is to transfer raw data streams to a central data center that entails communication over the wide-area network (WAN). However, this approach is inefficient and falls short for two main reasons: i) the burst in the amount of data generated at the network edge by an increasing number of connected devices, ii) the emergence of applications with predictable and low latency requirements. In this paper, we propose SpanEdge, a novel approach that unifies stream processing across a geodistributed infrastructure, including the central and near-theedge data centers. SpanEdge reduces or eliminates the latency incurred by WAN links by distributing stream processing applications across the central and the near-the-edge data centers. Furthermore, SpanEdge provides a programming environment, which allows programmers to specify parts of their applications that need to be close to the data source. Programmers can develop a stream processing application, regardless of the number of data sources and their geographical distributions. As a proof of concept, we implemented and evaluated a prototype of SpanEdge. Our results show that SpanEdge can optimally deploy the stream processing applications in a geo-distributed infrastructure, which significantly reduces the bandwidth consumption and the response latency.QC 20161005</p
    corecore